Current Issue : April - June Volume : 2020 Issue Number : 2 Articles : 5 Articles
Speech is the most significant mode of communication among human beings and a potential\nmethod for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion\nrecognition using these sensors from speech signals is an emerging area of research in HCI, which\napplies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment,\nhealthcare, and emergency call centers to determine the speakerâ??s emotional state from an individualâ??s\nspeech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion\nrecognition (SER) compared to state of the art and (ii) reducing the computational complexity of\nthe presented SER model. We propose an artificial intelligence-assisted deep stride convolutional\nneural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative\nfeatures from spectrogram of speech signals that are enhanced in prior steps to perform better. Local\nhidden patterns are learned in convolutional layers with special strides to down-sample the feature\nmaps rather than pooling layer and global discriminative features are learned in fully connected layers.\nA SoftMax classifier is used for the classification of emotions in speech. The proposed technique is\nevaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual\nDatabase of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%,\nrespectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of\nthe proposed SER technique and reveals its applicability in real-world applications....
Because one of the key issues in improving the performance of Speech Emotion Recognition\n(SER) systems is the choice of an effective feature representation, most of the research has focused\non developing a feature level fusion using a large set of features. In our study, we propose a\nrelatively low-dimensional feature set that combines three features: baseline Mel Frequency Cepstral\nCoefficients (MFCCs), MFCCs derived from DiscreteWavelet Transform (DWT) sub-band coefficients\nthat are denoted as DMFCC, and pitch based features. Moreover, the performance of the proposed\nfeature extraction method is evaluated in clean conditions and in the presence of several real-world\nnoises. Furthermore, conventional Machine Learning (ML) and Deep Learning (DL) classifiers\nare employed for comparison. The proposal is tested using speech utterances of both of the\nBerlin German Emotional Database (EMO-DB) and Interactive Emotional Dyadic Motion Capture\n(IEMOCAP) speech databases through speaker independent experiments. Experimental results show\nimprovement in speech emotion detection over baselines....
The process of listening to an audiobook is usually a rather passive act that does not require an active interaction. If spatial\ninteraction is incorporated into a storytelling scenario, can open. Possibilities of a novel experience which allows an active\nparticipation might affect the user-experience. The aim of this paper is to create a portable prototype system based on an\nembedded hardware platform, allowing listeners to get immersed in an interactive audio storytelling experience enhanced by\ndynamic binaural audio rendering. For the evaluation of the experience, a short story based on the horror narrative of Stephen\nKingâ??s Strawberry Springs is adapted and designed in virtual environments. A comparison among three different listening\nexperiences, namely, (i) monophonic (traditional audio story), (ii) static binaural rendering (state-of-the-art audio story), and (iii)\nour prototype, is conducted. We discuss the quality of the experience based on usability testing, physiological data, emotional\nassessments, and questionnaires for immersion and spatial presence. Results identify a clear trend for an increase in immersion\nwith our prototype compared to traditional audiobooks, showing also an emphasis on story-specific emotions, i.e., terror and fear....
Speaker diarization systems aim to find â??who spoke when?â?? in multi-speaker recordings.\nThe dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction\nrecordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds\nthe active speaker through audio-visual synchronization model for diarization. A pre-trained\naudio-visual synchronization model is used to find the synchronization between a visible person\nand the respective audio. For that purpose, short video segments comprised of face-only regions are\nacquired using a face detection technique and are then fed to the pre-trained model. This model is\na two streamed network which matches audio frames with their respective visual input segments.\nOn the basis of high confidence video segments inferred by the model, the respective audio frames\nare used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating\nspeaker specific clusters with high probability. We tested our approach on a popular subset of AMI\nmeeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal\nrecordings. A significant improvement is noticed with the proposed method in term of DER when\ncompared to conventional and fully supervised audio based speaker diarization. The results of the\nproposed technique are very close to the complex state-of-the art multimodal diarization which shows\nsignificance of such simple yet effective technique....
The Khokhlovâ??Zabolotskayaâ??Kuznetsov (KZK) equation has been widely used in the simulation and calculation of nonlinear sound fields.\nHowever, the accuracy of KZK equation reduced due to the deflection of the direction of the sound beam when the sound beam is inclined\nincidence. In this paper, an equivalent sound source model is proposed to make the calculation direction of KZK calculation model\nconsistent with the sound propagation direction after acoustic refraction, so as to improve the accuracy of sound field calculation under the\ninclined incident conditions. The theoretical research and pool experiment verify the feasibility and effectiveness of the proposed method....
Loading....